Goto

Collaborating Authors

 modality heterogeneity


Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media

arXiv.org Artificial Intelligence

Semantic location prediction aims to derive meaningful location insights from multimodal social media posts, offering a more contextual understanding of daily activities than using GPS coordinates. This task faces significant challenges due to the noise and modality heterogeneity in "text-image" posts. Existing methods are generally constrained by inadequate feature representations and modal interaction, struggling to effectively reduce noise and modality heterogeneity. To address these challenges, we propose a Similarity-Guided Multimodal Fusion Transformer (SG-MFT) for predicting the semantic locations of users from their multimodal posts. First, we incorporate high-quality text and image representations by utilizing a pre-trained large vision-language model. Then, we devise a Similarity-Guided Interaction Module (SIM) to alleviate modality heterogeneity and noise interference by incorporating both coarse-grained and fine-grained similarity guidance for improving modality interactions. Specifically, we propose a novel similarity-aware feature interpolation attention mechanism at the coarse-grained level, leveraging modality-wise similarity to mitigate heterogeneity and reduce noise within each modality. At the fine-grained level, we utilize a similarity-aware feed-forward block and element-wise similarity to further address the issue of modality heterogeneity. Finally, building upon pre-processed features with minimal noise and modal interference, we devise a Similarity-aware Fusion Module (SFM) to fuse two modalities with a cross-attention mechanism. Comprehensive experimental results clearly demonstrate the superior performance of our proposed method.


Balanced Multi-modal Federated Learning via Cross-Modal Infiltration

arXiv.org Artificial Intelligence

Federated learning (FL) underpins advancements in privacy-preserving distributed computing by collaboratively training neural networks without exposing clients' raw data. Current FL paradigms primarily focus on uni-modal data, while exploiting the knowledge from distributed multimodal data remains largely unexplored. Existing multimodal FL (MFL) solutions are mainly designed for statistical or modality heterogeneity from the input side, however, have yet to solve the fundamental issue,"modality imbalance", in distributed conditions, which can lead to inadequate information exploitation and heterogeneous knowledge aggregation on different modalities.In this paper, we propose a novel Cross-Modal Infiltration Federated Learning (FedCMI) framework that effectively alleviates modality imbalance and knowledge heterogeneity via knowledge transfer from the global dominant modality. To avoid the loss of information in the weak modality due to merely imitating the behavior of dominant modality, we design the two-projector module to integrate the knowledge from dominant modality while still promoting the local feature exploitation of weak modality. In addition, we introduce a class-wise temperature adaptation scheme to achieve fair performance across different classes. Extensive experiments over popular datasets are conducted and give us a gratifying confirmation of the proposed framework for fully exploring the information of each modality in MFL.